Database connection: First we connect to the database with pymysql. We have created a database.py class to control the connection.
In [1]:
from database import Database
database = Database(
'<host name>',
'<database name>',
'<user name>',
'<password>',
'utf8mb4'
)
connection = database.connect_with_pymysql()
Now we will import the questions from the database and clean those data. The cleaning includes the following steps: Step 1: Decode the data, remove specials character Step 2: Remove punctuation mark Step 3: Remove extra whitespace
After that we will update the clean data in our database. Finally we will close the database connection.
In [2]:
from preprocessor import Decoder, Cleaner
# decoder instance
decoder = Decoder()
if connection:
try:
with connection.cursor() as cursor:
# example: decode all questions
for data in decoder.decode_in_range(cursor, 'questions', 'body', 1, 99478):
if data:
if all(data):
try:
# example: punctuation remove
cleaned_data = Cleaner.punctuation_remover(data[1])
# example: whitespace reomve
cleaned_data = Cleaner.whitespace_remover(cleaned_data)
sql = "UPDATE questions SET body='" + cleaned_data + "' WHERE id= "+str(data[0])
cursor.execute(sql)
connection.commit()
except Exception:
print "Exception in updating id " + str(data[0])
finally:
connection.close()
After that we will update our database with the clean data on which we will continue our further analysis.